Search CORE

123 research outputs found

MoNoise: Modeling Noise Using a Modular Normalization System

Author: van der Goot Rob
van Noord Gertjan
Publication venue
Publication date: 01/01/2017
Field of study

We propose MoNoise: a normalization model focused on generalizability and efficiency, it aims at being easily reusable and adaptable. Normalization is the task of translating texts from a non- canonical domain to a more canonical domain, in our case: from social media data to standard language. Our proposed model is based on a modular candidate generation in which each module is responsible for a different type of normalization action. The most important generation modules are a spelling correction system and a word embeddings module. Depending on the definition of the normalization task, a static lookup list can be crucial for performance. We train a random forest classifier to rank the candidates, which generalizes well to all different types of normaliza- tion actions. Most features for the ranking originate from the generation modules; besides these features, N-gram features prove to be an important source of information. We show that MoNoise beats the state-of-the-art on different normalization benchmarks for English and Dutch, which all define the task of normalization slightly different.Comment: Source code: https://bitbucket.org/robvanderg/monois

arXiv.org e-Print Archive

Proceedings - University of Groningen

University of Groningen

ARTS repository - University of Groningen

Dissertations of the University of Groningen

To Normalize, or Not to Normalize: The Impact of Normalization on Part-of-Speech Tagging

Author: Nissim Malvina
Plank Barbara
van der Goot Rob
Publication venue
Publication date: 01/01/2017
Field of study

Does normalization help Part-of-Speech (POS) tagging accuracy on noisy, non-canonical data? To the best of our knowledge, little is known on the actual impact of normalization in a real-world scenario, where gold error detection is not available. We investigate the effect of automatic normalization on POS tagging of tweets. We also compare normalization to strategies that leverage large amounts of unlabeled data kept in its raw form. Our results show that normalization helps, but does not add consistently beyond just word embedding layer initialization. The latter approach yields a tagging model that is competitive with a Twitter state-of-the-art tagger.Comment: In WNUT 201

arXiv.org e-Print Archive

Proceedings - University of Groningen

University of Groningen

ARTS repository - University of Groningen

Dissertations of the University of Groningen

An In-depth Analysis of the Effect of Lexical Normalization on the Dependency Parsing of Social Media

Author: van der Goot Rob
Publication venue: 'Association for Computational Linguistics (ACL)'
Publication date: 01/10/2019
Field of study

Existing natural language processing systems have often been designed with standard texts in mind. However, when these tools are used on the substantially different texts from social media, their performance drops dramatically. One solution is to translate social media data to standard language before processing, this is also called normalization. It is well-known that this improves performance for many natural language processing tasks on social media data. However, little is known about which types of normalization replacements have the most effect. Furthermore, it is unknown what the weaknesses of existing lexical normalization systems are in an extrinsic setting. In this paper, we analyze the effect of manual as well as automatic lexical normalization for dependency parsing. After our analysis, we conclude that for most categories, automatic normalization scores close to manually annotated normalization and that small annotation differences are important to take into consideration when exploiting normalization in a pipeline setup

Proceedings - University of Groningen

University of Groningen

ARTS repository - University of Groningen

The IT University of Copenhagen's Repository

Dissertations of the University of Groningen

Normalization and parsing algorithms for uncertain input

Author: van der Goot Rob Matthijs
Publication venue: 'University of Groningen Press'
Publication date: 01/01/2019
Field of study

ARTS repository - University of Groningen

Challenges in Annotating and Parsing Spoken, Code-switched, Frisian-Dutch Data

Author: Braggaar Anouck
van der Goot Rob
Publication venue: 'Association for Computational Linguistics (ACL)'
Publication date: 01/04/2021
Field of study

The IT University of Copenhagen's Repository

Lexical Normalization for Code-switched Data and its Effect on POS Tagging

Author: van der Goot Rob
Çetinoğlu Özlem
Publication venue: 'Association for Computational Linguistics (ACL)'
Publication date: 31/01/2021
Field of study

Lexical normalization, the translation of non-canonical data to standard language, has shown to improve the performance of manynatural language processing tasks on social media. Yet, using multiple languages in one utterance, also called code-switching (CS), is frequently overlooked by these normalization systems, despite its common use in social media. In this paper, we propose three normalization models specifically designed to handle code-switched data which we evaluate for two language pairs: Indonesian-English (Id-En) and Turkish-German (Tr-De). For the latter, we introduce novel normalization layers and their corresponding language ID and POS tags for the dataset, and evaluate the downstream effect of normalization on POS tagging. Results show that our CS-tailored normalization models outperform Id-En state of the art and Tr-De monolingual models, and lead to 5.4% relative performance increase for POS tagging as compared to unnormalized input

arXiv.org e-Print Archive

The IT University of Copenhagen's Repository